This is my Exam 3 document
Lets load the data and take a look at it.
data<- read.csv(file = "BioLogData_Exam3.csv", sep = "|")
summary(data)
## Sample.ID Rep Well Dilution
## Clear_Creek:288 Min. :1 A1 : 36 Min. :0.001
## Soil_1 :288 1st Qu.:1 A2 : 36 1st Qu.:0.001
## Soil_2 :288 Median :2 A3 : 36 Median :0.010
## Waste_Water:288 Mean :2 A4 : 36 Mean :0.037
## 3rd Qu.:3 B1 : 36 3rd Qu.:0.100
## Max. :3 B2 : 36 Max. :0.100
## (Other):936
## Substrate Hr_24 Hr_48
## 2-Hydroxy Benzoic Acid : 36 Min. :0.0000 Min. :0.0000
## 4-Hydroxy Benzoic Acid : 36 1st Qu.:0.0000 1st Qu.:0.0060
## D-Cellobiose : 36 Median :0.0320 Median :0.2595
## D-Galactonic Acid γ-Lactone: 36 Mean :0.1703 Mean :0.4691
## D-Galacturonic Acid : 36 3rd Qu.:0.1872 3rd Qu.:0.7220
## D-Glucosaminic Acid : 36 Max. :2.6500 Max. :2.7850
## (Other) :936
## Hr_144
## Min. :0.00000
## 1st Qu.:0.04175
## Median :0.75200
## Mean :0.92497
## 3rd Qu.:1.67950
## Max. :3.11600
##
Lets do some exploratory analysis
pairs(data)
class(data$Sample.ID)
## [1] "factor"
class(data$Rep)
## [1] "integer"
class(data$Well)
## [1] "factor"
class(data$Dilution)
## [1] "numeric"
class(data$Substrate)
## [1] "factor"
class(data$Hr_24)
## [1] "numeric"
class(data$Hr_48)
## [1] "numeric"
class(data$Hr_144)
## [1] "numeric"
Some regressions models and summary stats.
a<- lm(formula = Dilution ~ Hr_24, data = data)
summary(a)
##
## Call:
## lm(formula = Dilution ~ Hr_24, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03664 -0.03607 -0.02750 0.06237 0.06787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.037644 0.001497 25.146 <2e-16 ***
## Hr_24 -0.003784 0.004173 -0.907 0.365
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04472 on 1150 degrees of freedom
## Multiple R-squared: 0.0007146, Adjusted R-squared: -0.0001544
## F-statistic: 0.8223 on 1 and 1150 DF, p-value: 0.3647
b<- lm(formula = Dilution ~ Hr_48, data = data)
summary(b)
##
## Call:
## lm(formula = Dilution ~ Hr_48, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03713 -0.03571 -0.02745 0.06198 0.06650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.038127 0.001710 22.296 <2e-16 ***
## Hr_48 -0.002403 0.002324 -1.034 0.301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04472 on 1150 degrees of freedom
## Multiple R-squared: 0.0009286, Adjusted R-squared: 5.981e-05
## F-statistic: 1.069 on 1 and 1150 DF, p-value: 0.3014
c<- lm(formula= Dilution ~ Hr_144, data = data)
summary(c)
##
## Call:
## lm(formula = Dilution ~ Hr_144, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04068 -0.03168 -0.02651 0.05956 0.07303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.041682 0.001923 21.68 < 2e-16 ***
## Hr_144 -0.005062 0.001520 -3.33 0.000896 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04452 on 1150 degrees of freedom
## Multiple R-squared: 0.00955, Adjusted R-squared: 0.008689
## F-statistic: 11.09 on 1 and 1150 DF, p-value: 0.0008963
Hr 144 is the most significant to the Dilution factor.
hist(data$Dilution)
hist(data$Hr_144)
hist(data$Hr_48)
hist(data$Hr_24)
names(data)
## [1] "Sample.ID" "Rep" "Well" "Dilution" "Substrate" "Hr_24"
## [7] "Hr_48" "Hr_144"
ggplot(data,aes(x=data$Dilution,y=data$Substrate)) +
geom_boxplot() + facet_wrap(~Sample.ID)
fig1<-ggplot(data,aes(x=data$Hr_24,fill= Sample.ID)) +
geom_histogram()
fig2<-ggplot(data,aes(x=data$Hr_24,fill= Substrate)) +
geom_histogram()
ggplotly(fig1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig3<-ggplot(data,aes(x=data$Hr_48,fill= Sample.ID)) +
geom_histogram()
fig4<-ggplot(data,aes(x=data$Hr_48,fill= Substrate)) +
geom_histogram()
fig5<-ggplot(data,aes(x=data$Hr_144,fill= Sample.ID)) +
geom_histogram()
fig6<-ggplot(data,aes(x=data$Hr_144,fill= Substrate)) +
geom_histogram()
ggplotly(fig3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig6)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Which sample locations are functionally different from each other in terms of what C-substrates they can utilize?
levels(data$Sample.ID)
## [1] "Clear_Creek" "Soil_1" "Soil_2" "Waste_Water"
levels(data$Substrate)
## [1] "2-Hydroxy Benzoic Acid" "4-Hydroxy Benzoic Acid"
## [3] "D-Cellobiose" "D-Galactonic Acid γ-Lactone"
## [5] "D-Galacturonic Acid" "D-Glucosaminic Acid"
## [7] "D-Mallic Acid" "D-Mannitol"
## [9] "D-Xylose" "D.L -α-Glycerol Phosphate"
## [11] "Glucose-1-Phosphate" "Glycogen"
## [13] "Glycyl-L-Glutamic Acid" "i-Erythitol"
## [15] "Itaconic Acid" "L-Arginine"
## [17] "L-Asparganine" "L-Phenylalanine"
## [19] "L-Serine" "L-Threonine"
## [21] "N-Acetyl-D-Glucosamine" "Phenylethylamine"
## [23] "Putrescine" "Pyruvic Acid Methyl Ester"
## [25] "Tween 40" "Tween 80 "
## [27] "Water" "α-Cyclodextrin"
## [29] "α-D-Lactose" "α-Ketobutyric Acid"
## [31] "β-Methyl-D- Glucoside" "γ-Hydroxybutyric Acid"
It looks like soil values are more different tan waste water and clear creak because they utilize more carbon substrates.
Are Soil and Water samples significantly different overall (as in, overall diversity of usable carbon sources)? What about for individual carbon substrates?
creek <- creek %>%
mutate(diversity="water")
wastewater <- wastewater %>%
mutate(diversity="water")
soil1 <- soil1 %>%
mutate(diversity="soil")
soil2 <- soil2 %>%
mutate(diversity="soil")
data<- rbind(creek, wastewater, soil1, soil2)
mod1<- lm(data= data, values ~ Substrate * diversity)
summary(mod1)
##
## Call:
## lm(formula = values ~ Substrate * diversity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3675 -0.3568 -0.1332 0.1984 2.6406
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.609204 0.082725
## Substrate4-Hydroxy Benzoic Acid 0.412444 0.116991
## SubstrateD-Cellobiose 0.409778 0.116991
## SubstrateD-Galactonic Acid γ-Lactone 0.116667 0.116991
## SubstrateD-Galacturonic Acid 0.382444 0.116991
## SubstrateD-Glucosaminic Acid 0.234093 0.116991
## SubstrateD-Mallic Acid 0.009667 0.116991
## SubstrateD-Mannitol 0.631981 0.116991
## SubstrateD-Xylose 0.333370 0.116991
## SubstrateD.L -α-Glycerol Phosphate -0.420556 0.116991
## SubstrateGlucose-1-Phosphate 0.015981 0.116991
## SubstrateGlycogen 0.242000 0.116991
## SubstrateGlycyl-L-Glutamic Acid 0.046333 0.116991
## Substratei-Erythitol 0.022241 0.116991
## SubstrateItaconic Acid 0.292037 0.116991
## SubstrateL-Arginine 0.432741 0.116991
## SubstrateL-Asparganine 0.758278 0.116991
## SubstrateL-Phenylalanine 0.172370 0.116991
## SubstrateL-Serine 0.581315 0.116991
## SubstrateL-Threonine 0.061593 0.116991
## SubstrateN-Acetyl-D-Glucosamine 0.595148 0.116991
## SubstratePhenylethylamine 0.134333 0.116991
## SubstratePutrescine -0.017556 0.116991
## SubstratePyruvic Acid Methyl Ester 0.391852 0.116991
## SubstrateTween 40 0.296056 0.116991
## SubstrateTween 80 0.332741 0.116991
## SubstrateWater -0.609204 0.116991
## Substrateα-Cyclodextrin 0.042630 0.116991
## Substrateα-D-Lactose -0.074815 0.116991
## Substrateα-Ketobutyric Acid -0.093981 0.116991
## Substrateβ-Methyl-D- Glucoside 0.056759 0.116991
## Substrateγ-Hydroxybutyric Acid -0.213148 0.116991
## diversitywater -0.561426 0.116991
## Substrate4-Hydroxy Benzoic Acid:diversitywater -0.210352 0.165450
## SubstrateD-Cellobiose:diversitywater 0.051815 0.165450
## SubstrateD-Galactonic Acid γ-Lactone:diversitywater 0.101000 0.165450
## SubstrateD-Galacturonic Acid:diversitywater -0.017000 0.165450
## SubstrateD-Glucosaminic Acid:diversitywater -0.127370 0.165450
## SubstrateD-Mallic Acid:diversitywater 0.107093 0.165450
## SubstrateD-Mannitol:diversitywater -0.134852 0.165450
## SubstrateD-Xylose:diversitywater -0.332352 0.165450
## SubstrateD.L -α-Glycerol Phosphate:diversitywater 0.469278 0.165450
## SubstrateGlucose-1-Phosphate:diversitywater 0.198574 0.165450
## SubstrateGlycogen:diversitywater 0.163907 0.165450
## SubstrateGlycyl-L-Glutamic Acid:diversitywater 0.108796 0.165450
## Substratei-Erythitol:diversitywater 0.160963 0.165450
## SubstrateItaconic Acid:diversitywater -0.168259 0.165450
## SubstrateL-Arginine:diversitywater -0.238370 0.165450
## SubstrateL-Asparganine:diversitywater -0.404093 0.165450
## SubstrateL-Phenylalanine:diversitywater -0.018630 0.165450
## SubstrateL-Serine:diversitywater -0.310259 0.165450
## SubstrateL-Threonine:diversitywater 0.133130 0.165450
## SubstrateN-Acetyl-D-Glucosamine:diversitywater -0.037500 0.165450
## SubstratePhenylethylamine:diversitywater -0.006741 0.165450
## SubstratePutrescine:diversitywater 0.118000 0.165450
## SubstratePyruvic Acid Methyl Ester:diversitywater -0.089815 0.165450
## SubstrateTween 40:diversitywater -0.120519 0.165450
## SubstrateTween 80 :diversitywater 0.112963 0.165450
## SubstrateWater:diversitywater 0.561426 0.165450
## Substrateα-Cyclodextrin:diversitywater 0.093185 0.165450
## Substrateα-D-Lactose:diversitywater 0.277833 0.165450
## Substrateα-Ketobutyric Acid:diversitywater 0.079000 0.165450
## Substrateβ-Methyl-D- Glucoside:diversitywater 0.252259 0.165450
## Substrateγ-Hydroxybutyric Acid:diversitywater 0.425648 0.165450
## t value Pr(>|t|)
## (Intercept) 7.364 2.23e-13 ***
## Substrate4-Hydroxy Benzoic Acid 3.525 0.000428 ***
## SubstrateD-Cellobiose 3.503 0.000467 ***
## SubstrateD-Galactonic Acid γ-Lactone 0.997 0.318723
## SubstrateD-Galacturonic Acid 3.269 0.001090 **
## SubstrateD-Glucosaminic Acid 2.001 0.045477 *
## SubstrateD-Mallic Acid 0.083 0.934152
## SubstrateD-Mannitol 5.402 7.04e-08 ***
## SubstrateD-Xylose 2.850 0.004405 **
## SubstrateD.L -α-Glycerol Phosphate -3.595 0.000329 ***
## SubstrateGlucose-1-Phosphate 0.137 0.891351
## SubstrateGlycogen 2.069 0.038665 *
## SubstrateGlycyl-L-Glutamic Acid 0.396 0.692098
## Substratei-Erythitol 0.190 0.849237
## SubstrateItaconic Acid 2.496 0.012599 *
## SubstrateL-Arginine 3.699 0.000220 ***
## SubstrateL-Asparganine 6.482 1.04e-10 ***
## SubstrateL-Phenylalanine 1.473 0.140744
## SubstrateL-Serine 4.969 7.07e-07 ***
## SubstrateL-Threonine 0.526 0.598593
## SubstrateN-Acetyl-D-Glucosamine 5.087 3.83e-07 ***
## SubstratePhenylethylamine 1.148 0.250950
## SubstratePutrescine -0.150 0.880727
## SubstratePyruvic Acid Methyl Ester 3.349 0.000819 ***
## SubstrateTween 40 2.531 0.011432 *
## SubstrateTween 80 2.844 0.004479 **
## SubstrateWater -5.207 2.03e-07 ***
## Substrateα-Cyclodextrin 0.364 0.715593
## Substrateα-D-Lactose -0.639 0.522544
## Substrateα-Ketobutyric Acid -0.803 0.421843
## Substrateβ-Methyl-D- Glucoside 0.485 0.627593
## Substrateγ-Hydroxybutyric Acid -1.822 0.068554 .
## diversitywater -4.799 1.66e-06 ***
## Substrate4-Hydroxy Benzoic Acid:diversitywater -1.271 0.203675
## SubstrateD-Cellobiose:diversitywater 0.313 0.754166
## SubstrateD-Galactonic Acid γ-Lactone:diversitywater 0.610 0.541600
## SubstrateD-Galacturonic Acid:diversitywater -0.103 0.918167
## SubstrateD-Glucosaminic Acid:diversitywater -0.770 0.441446
## SubstrateD-Mallic Acid:diversitywater 0.647 0.517493
## SubstrateD-Mannitol:diversitywater -0.815 0.415093
## SubstrateD-Xylose:diversitywater -2.009 0.044640 *
## SubstrateD.L -α-Glycerol Phosphate:diversitywater 2.836 0.004590 **
## SubstrateGlucose-1-Phosphate:diversitywater 1.200 0.230142
## SubstrateGlycogen:diversitywater 0.991 0.321913
## SubstrateGlycyl-L-Glutamic Acid:diversitywater 0.658 0.510853
## Substratei-Erythitol:diversitywater 0.973 0.330681
## SubstrateItaconic Acid:diversitywater -1.017 0.309235
## SubstrateL-Arginine:diversitywater -1.441 0.149750
## SubstrateL-Asparganine:diversitywater -2.442 0.014641 *
## SubstrateL-Phenylalanine:diversitywater -0.113 0.910354
## SubstrateL-Serine:diversitywater -1.875 0.060844 .
## SubstrateL-Threonine:diversitywater 0.805 0.421076
## SubstrateN-Acetyl-D-Glucosamine:diversitywater -0.227 0.820706
## SubstratePhenylethylamine:diversitywater -0.041 0.967504
## SubstratePutrescine:diversitywater 0.713 0.475766
## SubstratePyruvic Acid Methyl Ester:diversitywater -0.543 0.587267
## SubstrateTween 40:diversitywater -0.728 0.466400
## SubstrateTween 80 :diversitywater 0.683 0.494803
## SubstrateWater:diversitywater 3.393 0.000698 ***
## Substrateα-Cyclodextrin:diversitywater 0.563 0.573320
## Substrateα-D-Lactose:diversitywater 1.679 0.093193 .
## Substrateα-Ketobutyric Acid:diversitywater 0.477 0.633046
## Substrateβ-Methyl-D- Glucoside:diversitywater 1.525 0.127430
## Substrateγ-Hydroxybutyric Acid:diversitywater 2.573 0.010134 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6079 on 3392 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.2381
## F-statistic: 18.14 on 63 and 3392 DF, p-value: < 2.2e-16
If there are differences between samples and on which C-substrates are driving those differences?
Yes, there are differences. This can determined by seeing which C-substrates are signifcant in the above model.
Does the dilution factor change any of these answers?
Lets take a look and make some more models
## Df Sum Sq Mean Sq F value Pr(>F)
## creek$Dilution 1 16.75 16.748 109.7 <2e-16 ***
## Residuals 862 131.64 0.153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## soil1$Dilution 1 21.4 21.41 35.13 4.46e-09 ***
## Residuals 862 525.4 0.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## soil2$Dilution 1 54.0 53.99 105.8 <2e-16 ***
## Residuals 862 439.8 0.51
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## wastewater$Dilution 1 12.74 12.740 47.74 9.44e-12 ***
## Residuals 862 230.01 0.267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now lets add some predictions to a chosen model
data<- add_predictions(data= data, model = mod5)
p<-ggplot(data, aes(x= Substrate, y= values))+
geom_point()+ facet_wrap(~diversity)+
geom_point(aes(y=pred), col= "orange")+
theme(axis.text.x = element_text(angle = 90))
ggplotly(p)
Less carbon is being consumed as the concentration of the soil samples increase.
More carbon is being consumed as the concentration of the water samples increase.
Do the control samples indicate any contamination?
Since water is a negative control, if the BioLog reads anything other than a 0 we can assume there was contamination.
## # A tibble: 108 x 9
## Sample.ID Rep Well Dilution Substrate inc_hrs values diversity pred
## <fct> <int> <fct> <dbl> <fct> <chr> <dbl> <chr> <dbl>
## 1 Clear_Creek 1 A1 0.001 Water Hr_144 0 water 2.90e-14
## 2 Clear_Creek 1 A1 0.001 Water Hr_48 0 water 2.90e-14
## 3 Clear_Creek 1 A1 0.001 Water Hr_24 0 water 2.90e-14
## 4 Clear_Creek 1 A1 0.01 Water Hr_144 0 water 2.90e-14
## 5 Clear_Creek 1 A1 0.01 Water Hr_48 0 water 2.90e-14
## 6 Clear_Creek 1 A1 0.01 Water Hr_24 0 water 2.90e-14
## 7 Clear_Creek 1 A1 0.1 Water Hr_144 0 water 2.90e-14
## 8 Clear_Creek 1 A1 0.1 Water Hr_48 0 water 2.90e-14
## 9 Clear_Creek 1 A1 0.1 Water Hr_24 0 water 2.90e-14
## 10 Clear_Creek 2 A1 0.001 Water Hr_144 0 water 2.90e-14
## # … with 98 more rows
Notice the values for Hr_24, Hr_48, and Hr_144, we can see that all of the values are 0, telling us that there was no contamination.